feat: Add tool selection safety scorer and role-play transform #311
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
[AIRT] Tool selection safety and role-play testing
Key Changes:
tool_selection_safetyscorer to detect when agents choose dangerous tools over safe alternatives (returns 1.0 for unsafe, 0.5 for questionable, 0.0 for safe)role_play_wrappertransform with 4 scenarios (educational, fictional, historical, research) for jailbreak testingllm_judge: tool_selection_safety, unbounded_agency, web_chatbot_securityAdded:
tool_selection_safety()scorer intool_invocation.pyrole_play_wrapper()transform instylistic.pydreadnode/data/rubrics/Usage:
```python
Tool safety scorer
scorer = dn.scorers.tool_selection_safety(
safe_tools=["http_get", "tcp_connect"],
dangerous_tools=["shell_execute"],
)
Role-play transform
transform = dn.transforms.stylistic.role_play_wrapper(
scenario="educational",
character="security researcher",
)
```
Generated Summary:
Summary of Changes:
Introduced new rubrics for safety evaluation in AI tools:
Added a scoring mechanism for tool selection safety, focusing on:
Implemented a role-play wrapper transform for testing against jailbreak attempts, distinguishing between legitimate educational inquiries and potentially harmful requests.
Included an example Jupyter notebook to demonstrate the usage of both the tool selection safety scorer and the role-play wrapper, with practical scenarios to illustrate scoring and evaluation.
Potential Impact:
This summary was generated with ❤️ by rigging